CAPSTONE PROJECT MACHINE LEARNING

DOMAIN: Housing

Problem statement:

Identifying correct price of any real estate has been always a little complex process as it needs to consider various factors and property features which varies drastically. It becomes difficult to quote correct price for any particular house because of the factors such as location of the property like coastal or hill stationed property or in centre of the city, accessibility towards emergency services like hospitals, accessibility towards local transportations like Airport, Metro and bus stations, builder or type of build, amenities available along with the house and house conditions. Additionally, the price at which neighbourhood houses are being sold also highly affects the house prices

To overcome above mentioned hurdles buyers and sellers seek experts to help them estimate correct price for their respective property. A real estate agent gets involved from both buyer and sellers end who possesses extensive knowledge of the real estate. Such agents shall convince buyers to buy a specified property and likewise convince seller to sell their property. Under such circumstances potential Agent acting as a middle man would quote an amount to buyer after adding hidden cost upon seller’s asking price. In such case buyer pays more than the actual property value. Likewise, agents also convince sellers on their property not being up to the mark and will have to adjust the asking price and to reduce a certain amount from same. Such reduced amount is again added into Agent’s bucket.

Objective:

Take advantage of all of the feature variables available below, use it to analyse and predict house prices.

  1. cid: a notation for a house
  2. dayhours: Date house was sold
  3. price: Price is prediction target
  4. room_bed: Number of Bedrooms/House
  5. room_bath: Number of bathrooms/bedrooms
  6. living_measure: square footage of the home
  7. lot_measure: square footage of the lot
  8. ceil: Total floors (levels) in house
  9. coast: House which has a view to a waterfront
  10. sight: Has been viewed
  11. condition: How good the condition is (Overall)
  12. quality: grade given to the housing unit, based on grading system
  13. ceil_measure: square footage of house apart from basement
  14. basement_measure: square footage of the basement
  15. yr_built: Built Year
  16. yr_renovated: Year when house was renovated
  17. zipcode: zip
  18. lat: Latitude coordinate
  19. long: Longitude coordinate
  20. living_measure15: Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
  21. lot_measure15: lotSize area in 2015(implies-- some renovations)
  22. furnished: Based on the quality of room
  23. total_area: Measure of both living and lot

Importing Required Python Libraries and Modules

Here we are importing all the Libraries and Modules that are needed in a single cell and set some requirements for the whole project.


Import and Warehouse Data

Here we are importing the datasets and explore shape and size of it.

Note:-

- Exploring the Data


Key Observations:-


Data Cleansing

- Modification of the Attributes


Key Observations:-


- Exploring Datatypes of Each Attribute


Key Observations:-


- Checking for Null Values in the Attributes


Key Observations:-


- Checking for Duplicate Values


Key Observations:-


Data Analysis & Visualisation:

Information about the Features

For further analysis, we first differentiate between different types of Attributes.

1. Qualitative Attributes:-

2. Quantitative Attributes:-

- Statistical Summary of Data

Here we will observe some basic statistical details like percentile, mean, std etc.


Key Observations:-


- Checking Skewness of the Attributes

Skewness refers to a distortion or asymmetry that deviates from the symmetrical bell curve, or normal distribution, in a set of data.


Key Observations:-

By observing above table we can say,


- Checking Correlation of the Attributes

“Correlation” is a statistical term describing the degree to which two variables move in coordination with one-another.


Key Observations:-


- Univariate Analysis

The objective of univariate analysis is to derive the data, define and summarize it, and analyze the pattern present in it.

Creating Functions for Plotting the Quantitative and Qualitative Attributes for Univariate Analysis.

We will use these functions for easy analysis of individual attribute.


Key Observations:-


Attribute 1: Price


Key Observations:-


Attribute 2: Living_Measure


Key Observations:-


Attribute 3: Lot_Measure


Key Observations:-


Attribute 4: Ceil_Measure


Key Observations:-


Attribute 5: Lat


Key Observations:-


Attribute 6: Long


Key Observations:-


Attribute 7: Living_measure15


Key Observations:-


Attribute 8: Lot_measure15


Key Observations:-


Attribute 9: Total_Area


Key Observations:-


Attribute 10: Yr_Built


Key Observations:-


Attribute 11: Yr_Renovated


Key Observations:-


Attribute 12: Basement


Key Observations:-


Attribute 13: Coast


Key Observations:-


Attribute 14: Sight


Key Observations:-


Attribute 15: Furnished


Key Observations:-


Attribute 16: Room_bed


Key Observations:-


Attribute 17: Room_bath


Key Observations:-


Attribute 18: Ceil


Key Observations:-


Attribute 19: Condition


Key Observations:-


Attribute 20: Quality


Key Observations:-


- Bivariate Analysis

Creating Functions for Plotting the Quantitative VS Categorical Data for Bivariate Analysis

By this function we can do our analysis easily.

Bivariate Analysis 1: 'room_bed' VS all Quantitative attributes


Key Observations:-


Bivariate Analysis 2: 'ceil' VS all Quantitative attributes


Key Observations:-


Bivariate Analysis 3: 'condition' VS all Quantitative attributes


Key Observations:-


Bivariate Analysis 4: 'quality' VS all Quantitative attributes


Key Observations:-


Bivariate Analysis 5: 'coast' VS all Quantitative attributes


Key Observations:-


Bivariate Analysis 6: 'sight' VS all Quantitative attributes


Key Observations:-


Bivariate Analysis 7: 'furnished' VS all Quantitative attributes


Key Observations:-


- Multivariate Analysis


Key Observations:-


Finding the Hidden Patterns

- Dividing the houses based on Price Range(PR)

Here we divide the Prices of the House into 3 Price Ranges,


Key Observations:-


- Finding Top 10 Popular areas where the Buyers shown their interest to Buy Properties


Key Observations:-


- Finding the Bedroom Trends


Key Observations:-


- Finding Premium Houses having following requirement:-


Key Observations:-


- Cheapest Houses sold in Coastal areas


Key Observations:-


- Checking the Price Comaprison between Renovated and Non-renovated houses


Key Observations:-


- Finding the Month in which buyers bought more number of houses.


Key Observations:-


- Finding relation between Year_built and Price.


Key Observations:-


Data Pre-Processing

- Dropping the Attributes


Key Observations:-


- Outlier Analysis

FLOORING AND CAPPING

In this quantile-based technique, we will do the flooring(e.g 25th percentile) for the lower values and capping(e.g for the 75th percentile) for the higher values. These percentile values will be used for the quantile-based flooring and capping.


Key Observations:-


- Data Standardization

Since our Continuous attributes are Normally Distributed, we will do Z-Score Standardization.

By Standardizing the values, we will get the following statistics of the data distribution whose,


Key Observations:-


- Segregation of Predictors and Target Attributes.

We will seperate all other attributes from our target attribute 'price' for further process.


Key Observations:-


- Splitting the Data into Training set and Test Set

NOTE:- After checking the splitting ratios 70:30, 75:25 and 80:20 we decided to keep 75:25 ratio, since this ratio is giving better accuracy when compared to all other ratios.


Key Observations:-


Modeling

The process of modeling means training a machine learning algorithm to predict the labels from the features, tuning it for the business need, and validating it on holdout data. The output from modeling is a trained model that can be used for inference, making predictions on new data points.

- Model Training and Getting Accuracies


Key Observations:-


- Model Selection

- Feature Selection

NOTE:- Here we are choosing top 6 regressors based on model accuracy.

1. Checking Feature Importance


Key Observations:-


2. Checking the accuracies of the model based on different threshold values for the top importance scores.


Key Observations:-


3. Setting the Threshold value and Selecting the Features.


Key Observations:-


- Model Tuning and Feature Reduction

Parameter Tuning

Parameter tuning is done to improve the accuracy of the model by changing the parameter values of the model.

Feature Reduction

Feature reduction is done to reduce the data dimension and to increase model's computation time. By this we can improve our model's performance.


Key Observations:-


- Checking the Performance of Final Model


Key Observations:-


Pickling the selected model.

Pickle is the standard way of serializing objects in Python.


Key Observations:-


Predicting New Data

NOTE:- 'raw_data' below is used for displaying user input.


Key Observations:-


Conclusion

In the beginning we imported our Housing data file, and made some explorations on it. Then we had done data cleansing process. In this we made some modifications on 'dayhours' attribute. Further we had done some statistical analysis on our data to find some insights regarding mean, standard deviation etc. In EDA part we fond that there are many outliers in the continuous variables. These outliers are handled by flooriing and capping method, under data pre-processing. We had found some hidden patterns to get better insight of the data.

After segregating the data interms of predictors and target attributes, we splitted the data into train and test accuracies with 75:25 ratio. Then we applied all the possible models and that are suitable for our data. By observing the results from all the models we choose GradientBoostingRegressor as our final model. Further we did feature selection by considering top 6 regressors which are having high accuracies. After that we did model tuning along with feature reduction. At the end we have 13 input features and a target. We applied these features to check the model performance.

Model performance of GradientBoostingRegressor is showing 89.32% test accuracy with 0.32 rmse. We pickled the same model in .pkl file, then we rechecked it. Using the pickled model we predicted the price values for new data and our model performed pretty good.

Closing Sentence:- The price predictions made by our model will help the buyer/seller to find the approximate price of the house.